This project will use R Markdown to examine and visualize data on public school enrollment in the United States from the 1970’s to the early 2000’s. Acquired from the U.S. Census Bureau, the data includes enrollment data on national, state, and county level. Using multiple packages, I will manipulate the original data sets by creating variables and functions to reshape, modify, and plot the data provided. Techniques used to achieve these goals were gained from lectures, assignments, and resources provided by ST 558: Data Science for Statisticians at North Carolina State University in Fall 2022.
The data set read in below is one section of public school enrollment
data. The readr package is required to compute the original
.csv delimited file into an object, or data structure, R
can easily use. The resulting object is a tibble named
sheet1.
library(readr)
sheet1 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010187N1, EDU010187N2, EDU010188N1, EDU010188...
## dbl (20): EDU010187F, EDU010187D, EDU010188F, EDU010188D, EDU010189F, EDU010...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Running the code chunk below creates a new tibble object named
enrollment1 containing only the Area_name
column (renamed here as area_name), STCOU, and
all columns ending in “D” from sheet1. The
tidyverse package is required to use chaining
(%>%).
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.9
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
enrollment1 <- sheet1 %>%
select(Area_name, STCOU, ends_with("D")) %>%
rename("area_name" = Area_name)
enrollment1
Running the code chunk below reshapes the enrollment1
tibble from wide to long format. The resulting object is a long format
tibble named enrollment2 where each row, or observation,
has only one enrollment value (studentsEnrolled) for the
area_name. The column names ending in “D” from
enrollment1 are now under the measurementCode
variable.
enrollment2 <- enrollment1 %>%
pivot_longer(cols = 3:12, names_to = "measurementCode", values_to = "studentsEnrolled")
enrollment2
Running the code chunk below manipulates the enrollment data further
by splitting the measurementCode variable into two separate
columns. The mutate() function is used to create the new
variables and the substr() function parses the character
string values from enrollment2’s
measurementCode into a seven character
measurementType and a two character year. The
year variable is then converted to a two digit numeric
variable. This step is required to change the two digit
year into a four digit schoolYear. All
unnecessary columns are then dropped from the new tibble named
enrollment4.
enrollment3 <- enrollment2 %>%
mutate(measurementType = substr(measurementCode, 1, 7),
year = substr(measurementCode, 8, 9))
enrollment3$year <- as.numeric(enrollment3$year)
enrollment4 <- enrollment3 %>%
mutate(schoolYear = if_else(year <= 22 , year + 2000, year + 1900)) %>%
select(area_name, STCOU, measurementType, schoolYear, studentsEnrolled)
enrollment4
Running the code chunk below creates a tibble from the
enrollment4 data set containing only county data
(county) with a class called
county. To achieve this for the county level data, the
grep() function was used to look through the character
strings in area_name to find all the rows with a comma
since all county names in the data set are followed by a comma and the
two letter state abbreviation.
county <- enrollment4 %>%
slice(grep(pattern = ", \\w\\w", area_name))
county
class(county) <- c("county", class(county))
class(county)
## [1] "county" "tbl_df" "tbl" "data.frame"
Running the code chunk below creates a tibble from the
enrollment4 data set containing only non-county data
(noncounty) with a class called
state. Similar to the process used above, the
grep() function was used to look through the character
strings in area_name to find all the rows that do not have
a comma within the string.
noncounty <- enrollment4 %>%
filter(!grepl(pattern = ", \\w\\w", area_name))
noncounty
class(noncounty) <- c("state", class(noncounty))
class(noncounty)
## [1] "state" "tbl_df" "tbl" "data.frame"
Running the code chunk below alters the existing county
tibble to add a new variable for the state associated with the
area_name. As discussed previously, all county names in the
data set are followed by a comma and the two letter state abbreviation.
To create a state variable containing the state
abbreviations only, the nchar() function was used to count
the number of characters in the string and the substr()
function saved only the last two characters.
county <- county %>%
mutate(state = substr(area_name, nchar(area_name) - 1, nchar(area_name)))
county
The U.S. Census Bureau organizes states into four regions and nine divisions. These divisions are:
Running the code chunk below alters the existing
noncounty tibble to add a new variable for the division
associated with the area_name. Using the Census Bureau’s list
of states by division, a vector was created for each division and
used to populate the division variable by searching the
area_name column for any state corresponding with it. Any
row with an area_name not corresponding to a division will
return ERROR in the new column.
noncounty <- noncounty %>%
mutate(division = if_else(area_name %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT"), "New England",
if_else(area_name %in% c("NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Middle Atlantic",
if_else(area_name %in% c("INDIANA", "ILLINOIS", "MICHIGAN", "OHIO", "WISCONSIN"), "East North Central",
if_else(area_name %in% c("IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "West North Central",
if_else(area_name %in% c("DELAWARE", "DISTRICT OF COLUMBIA", "District of Columbia", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA", "WEST VIRGINIA"), "South Atlantic",
if_else(area_name %in% c("ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE"), "East South Central",
if_else(area_name %in% c("ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "West South Central",
if_else(area_name %in% c("ARIZONA", "COLORADO", "IDAHO", "NEW MEXICO", "MONTANA", "UTAH", "NEVADA", "WYOMING"), "Mountain",
if_else(area_name %in% c("ALASKA", "CALIFORNIA", "HAWAII", "OREGON", "WASHINGTON"), "Pacific", "ERROR"))))))))))
noncounty
The data set read in below is a second section of public school
enrollment data saved as a new tibble named sheet2.
sheet2 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010197N1, EDU010197N2, EDU010198N1, EDU010198...
## dbl (20): EDU010197F, EDU010197D, EDU010198F, EDU010198D, EDU010199F, EDU010...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Running the code chunk below creates a function to execute steps 1
and 2 (executeStep12). The temporary object
pivot first selects only Area_name,
STCOU, and all columns ending in “D” from the data frame
provided by the user and renames the area_name variable.
The tibble then gets reshaped from wide to long format. The column names
ending in “D” from the data frame are now under the
measurementCode variable, with one enrollment value in each
row. The user running the executeStep12 function has the
option to name the enrollment value (x), but if they choose
not to do so, the default name for the column is
studentsEnrolled.
executeStep12 <- function(df, z = "studentsEnrolled"){
pivot <- df %>%
select(Area_name, STCOU, ends_with("D")) %>%
rename("area_name" = Area_name) %>%
pivot_longer(cols = 3:12, names_to = "measurementCode", values_to = z)
return(pivot)
}
step12 <- executeStep12(sheet2)
step12
Running the code chunk below creates a function to execute step 3
(executeStep3). The first temporary object
splitCode manipulates the data frame resulting from step 2
(provided by the user) by splitting the measurementCode
variable into two separate columns. The mutate() function
is used to create the new variables and the substr()
function parses the character string values from
measurementCode into a seven character
measurementType and a two character year. The
year variable is then converted to a two digit numeric
variable in order to change the two digit year into a four
digit schoolYear in the last temporary object
fullYear. All unnecessary columns are then dropped from the
new tibble.
executeStep3 <- function(df, z = "studentsEnrolled"){
splitCode <- df %>%
mutate(measurementType = substr(measurementCode, 1, 7),
year = substr(measurementCode, 8, 9))
splitCode$year <- as.numeric(splitCode$year)
fullYear <- splitCode %>%
mutate(schoolYear = if_else(year <= 22 , year + 2000, year + 1900)) %>%
select(area_name, STCOU, measurementType, schoolYear, z)
return(fullYear)
}
step3 <- executeStep3(step12)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(z)` instead of `z` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
step3
Running the code chunk below creates a function to execute step 5
(executeStep5). Meant to be used on the county
tibble from step 4, the temporary object addState adds the
new state variable for the state associated with the
area_name by using the nchar() function to
count the number of characters in the string and the
substr() function to save only the last two characters in
the string containing the two letter state abbreviation.
executeStep5 <- function(df){
addState <- df %>%
mutate(state = substr(area_name, nchar(area_name) - 1, nchar(area_name)))
return(addState)
}
Running the code chunk below creates a function to execute step 6
(executeStep6). Meant to be used on the
noncounty tibble from step 4, the temporary object
addDivision adds a new variable for the division associated
with the area_name using vectors containing all the states
in each division. The %in% operator populates the
division variable by searching the area_name
column for any state corresponding with it. Any row with an
area_name not corresponding to a division will return
ERROR in the new column.
executeStep6 <- function(df){
addDivision <- df %>%
mutate(division = if_else(area_name %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT"), "New England",
if_else(area_name %in% c("NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Middle Atlantic",
if_else(area_name %in% c("INDIANA", "ILLINOIS", "MICHIGAN", "OHIO", "WISCONSIN"), "East North Central",
if_else(area_name %in% c("IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "West North Central",
if_else(area_name %in% c("DELAWARE", "DISTRICT OF COLUMBIA", "District of Columbia", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA", "WEST VIRGINIA"), "South Atlantic",
if_else(area_name %in% c("ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE"), "East South Central",
if_else(area_name %in% c("ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "West South Central",
if_else(area_name %in% c("ARIZONA", "COLORADO", "IDAHO", "NEW MEXICO", "MONTANA", "UTAH", "NEVADA", "WYOMING"), "Mountain",
if_else(area_name %in% c("ALASKA", "CALIFORNIA", "HAWAII", "OREGON", "WASHINGTON"), "Pacific", "ERROR"))))))))))
return(addDivision)
}
Running the code chunk below creates a function to execute steps 4, 5
and 6 (executeStep456).
county object creates a tibble from the
data frame resulting from step 2 (provided by the user) containing only
county data (county) by subsetting the data set based on
all the rows with a comma in area_name. The
class() function overwrites the class name by
adding another called county. The temporary object
countyStep5 runs the previously written
executeStep5 function to add the state
variable to the resulting countyData tibble.noncounty object creates a second tibble
containing only non-county data (noncounty) by subsetting
the data set based on all the rows without a comma in
area_name. The class() function overwrites the
class name by adding another called state. The
temporary object noncountyStep6 runs the previously written
executeStep6 function to add the division
variable to the resulting noncountyData tibble.list()
allows executeStep456 to return a list containing two
separate tibbles (countyData and
noncountyData).executeStep456 <- function(df){
county <- df %>%
slice(grep(pattern = ", \\w\\w", area_name))
class(county) <- c("county", class(county))
countyStep5 <- executeStep5(county)
noncounty <- df %>%
filter(!grepl(pattern = ", \\w\\w", enrollment4$area_name))
class(noncounty) <- c("state", class(noncounty))
noncountyStep6 <- executeStep6(noncounty)
return(list(countyData = countyStep5, noncountyData = noncountyStep6))
}
step456 <- executeStep456(step3)
step456
## $countyData
## # A tibble: 31,450 × 6
## area_name STCOU measurementType schoolYear studentsEnrolled state
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Autauga, AL 01001 EDU0101 1997 8099 AL
## 2 Autauga, AL 01001 EDU0101 1998 8211 AL
## 3 Autauga, AL 01001 EDU0101 1999 8489 AL
## 4 Autauga, AL 01001 EDU0102 2000 8912 AL
## 5 Autauga, AL 01001 EDU0102 2001 8626 AL
## 6 Autauga, AL 01001 EDU0102 2002 8762 AL
## 7 Autauga, AL 01001 EDU0152 2003 9105 AL
## 8 Autauga, AL 01001 EDU0152 2004 9200 AL
## 9 Autauga, AL 01001 EDU0152 2005 9559 AL
## 10 Autauga, AL 01001 EDU0152 2006 9652 AL
## # … with 31,440 more rows
##
## $noncountyData
## # A tibble: 530 × 6
## area_name STCOU measurementType schoolYear studentsEnrolled division
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 UNITED STATES 00000 EDU0101 1997 44534459 ERROR
## 2 UNITED STATES 00000 EDU0101 1998 46245814 ERROR
## 3 UNITED STATES 00000 EDU0101 1999 46368903 ERROR
## 4 UNITED STATES 00000 EDU0102 2000 46818690 ERROR
## 5 UNITED STATES 00000 EDU0102 2001 47127066 ERROR
## 6 UNITED STATES 00000 EDU0102 2002 47606570 ERROR
## 7 UNITED STATES 00000 EDU0152 2003 48506317 ERROR
## 8 UNITED STATES 00000 EDU0152 2004 48693287 ERROR
## 9 UNITED STATES 00000 EDU0152 2005 48978555 ERROR
## 10 UNITED STATES 00000 EDU0152 2006 49140702 ERROR
## # … with 520 more rows
Running the code chunk below creates a wrapper function to execute
data processing steps 1 through 6 (my_wrapper). Provided a
URL of a .csv file from the user, the wrapper function will
read in the data set in the first temporary object data,
call the function to run steps 1 and 2 (executeStep12) in
question12, call the function to run step 3
(executeStep3) in question3, and call the
function to run the final steps (executeStep456) in
question456. Since the wrapper function ends with the
executeStep456 function, it returns a list containing two
separate tibbles (countyData and
noncountyData).
my_wrapper <- function(url, z = "studentsEnrolled"){
data <- read_csv(url)
question12 <- executeStep12(data, z)
question3 <- executeStep3(question12, z)
question456 <- executeStep456(question3)
return(question456)
}
Running the code chunk below calls the wrapper function
my_wrapper to read in and parse the two .csv
files for the first and second data sets. The resulting objects
(eduA and eduB) are two lists containing two
tibbles each (eduA$countyData,
eduA$noncountyData, eduB$countyData and
eduB$noncountyData).
eduA <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010187N1, EDU010187N2, EDU010188N1, EDU010188...
## dbl (20): EDU010187F, EDU010187D, EDU010188F, EDU010188D, EDU010189F, EDU010...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eduB <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010197N1, EDU010197N2, EDU010198N1, EDU010198...
## dbl (20): EDU010197F, EDU010197D, EDU010198F, EDU010198D, EDU010199F, EDU010...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Running the code chunk below creates a function
(combineEnrollment) to combine the four tibbles resulting
from the wrapper function my_wrapper into two tibbles,
merging the two county level data sets (eduA$countyData and
eduB$countyData) and the two non-county level data sets
(eduA$noncountyData and eduB$noncountyData)
using using dplyr::bind_rows(). The result is a list
containing the two tibbles.
combineEnrollment <- function(df1, df2){
countyData <- dplyr::bind_rows(df1$countyData, df2$countyData)
noncountyData <- dplyr::bind_rows(df1$noncountyData, df2$noncountyData)
return(list(countyData = countyData, noncountyData = noncountyData))
}
Running the code chunk below calls the combine function
combineEnrollment to combine eduA and
eduB into one list (eduAB) containing two
tibbles corresponding to county level data (countyData) and
non-county level data (noncountyData).
eduAB <- combineEnrollment(eduA, eduB)
eduAB
## $countyData
## # A tibble: 62,900 × 6
## area_name STCOU measurementType schoolYear studentsEnrolled state
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Autauga, AL 01001 EDU0101 1987 6829 AL
## 2 Autauga, AL 01001 EDU0101 1988 6900 AL
## 3 Autauga, AL 01001 EDU0101 1989 6920 AL
## 4 Autauga, AL 01001 EDU0101 1990 6847 AL
## 5 Autauga, AL 01001 EDU0101 1991 7008 AL
## 6 Autauga, AL 01001 EDU0101 1992 7137 AL
## 7 Autauga, AL 01001 EDU0101 1993 7152 AL
## 8 Autauga, AL 01001 EDU0101 1994 7381 AL
## 9 Autauga, AL 01001 EDU0101 1995 7568 AL
## 10 Autauga, AL 01001 EDU0101 1996 7834 AL
## # … with 62,890 more rows
##
## $noncountyData
## # A tibble: 1,060 × 6
## area_name STCOU measurementType schoolYear studentsEnrolled division
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 UNITED STATES 00000 EDU0101 1987 40024299 ERROR
## 2 UNITED STATES 00000 EDU0101 1988 39967624 ERROR
## 3 UNITED STATES 00000 EDU0101 1989 40317775 ERROR
## 4 UNITED STATES 00000 EDU0101 1990 40737600 ERROR
## 5 UNITED STATES 00000 EDU0101 1991 41385442 ERROR
## 6 UNITED STATES 00000 EDU0101 1992 42088151 ERROR
## 7 UNITED STATES 00000 EDU0101 1993 42724710 ERROR
## 8 UNITED STATES 00000 EDU0101 1994 43369917 ERROR
## 9 UNITED STATES 00000 EDU0101 1995 43993459 ERROR
## 10 UNITED STATES 00000 EDU0101 1996 44715737 ERROR
## # … with 1,050 more rows
Running the code chunk below writes a custom plot
function (plot.state) according to the state
class added to the non-county level data in step 4. The temporary object
avgEnroll first filters out any observations not
corresponding to a division, which was programmed to return
ERROR for the division variable in step 6. The
data was then grouped by division and
schoolYear, and a new variable avgEnrollment
was added to calculate the mean value of the enrollment statistic (named
studentsEnrolled by default) across the years
(schoolYear) for each division. Since the
enrollment statistic is user defined x which does not
appear in the original data frame, the get() function needs
to be used within the mean(). The function plots a line
graph of the mean of the enrollment statistic (named
studentsEnrolled by default) for each division
per year observed (schoolYear).
plot.state <- function(df, z = "studentsEnrolled"){
avgEnroll <- df %>%
filter(division != "ERROR") %>%
group_by(division, schoolYear) %>%
summarise(avgEnrollment = mean(get(z)))
ggplot(avgEnroll, aes(x = schoolYear, y = avgEnrollment, color = division)) +
geom_line()
}
Running the code chunk below writes a custom plot
function (plot.county) according to the county
class added to the non-county level data in step 4.
x (named
studentsEnrolled by default)st (‘IL’ by default)org (‘top’ indicating to organize
data from largest to smallest by default)n (default value of
5).filterData first subsets the data
to only include observations from the state indicated
(st). The data was then grouped by area_name,
and a new variable avgEnrollment was added to calculate the
mean value of the enrollment statistic for each county.orderData sorts the data according
to the user’s choice of organization org, subsets the rows
data frame by the user defined n, and selects only the
resulting area_name column. The result will later be used
to subset the original tibble for plotting.
org by the user, the data
frame will be organized from largest to smallest and select the first
n county names.org by the user, the data
frame will be organized from smallest to largest and select the first
n county names.org by the user, the stop() function prints an
error message.filterOrder subsets the original
data frame by orderData to select only observations
specified by the user.The function plots a line graph of the enrollment statistic for each county fitting the user specification across the years.
plot.county <- function(df, z = "studentsEnrolled", st = "IL", org = "top", n = 5){
filterData <- df %>%
filter(state == st) %>%
group_by(area_name) %>%
summarise(avgEnrollment = mean(get(z)))
orderData <- if(org == "top") {
arrange(filterData, desc(avgEnrollment)) %>%
top_n(n) %>%
select(area_name)
} else if(org == "bottom") {
arrange(filterData, avgEnrollment) %>%
top_n(n) %>%
select(area_name)
} else {
stop("Must specify organizational preference (org)")
}
filterOrder <- df[df$area_name %in% orderData$area_name, ]
ggplot(filterOrder, aes(x = schoolYear, y = get(z), color = area_name)) +
geom_line()
}
Running the code chunk below calls the wrapper function
my_wrapper to read in and parse the two .csv
files for the first and second data sets. The variable name for the
enrollment statistic has been changed to enrollment. The
resulting objects (edu01A and edu01B) are two
lists containing two tibbles each (edu01A$countyData,
edu01A$noncountyData, edu01B$countyData and
edu01B$noncountyData).
edu01A <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv", z = "enrollment")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010187N1, EDU010187N2, EDU010188N1, EDU010188...
## dbl (20): EDU010187F, EDU010187D, EDU010188F, EDU010188D, EDU010189F, EDU010...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
edu01B <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv", z = "enrollment")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010197N1, EDU010197N2, EDU010198N1, EDU010198...
## dbl (20): EDU010197F, EDU010197D, EDU010198F, EDU010198D, EDU010199F, EDU010...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Running the code chunk below calls the combine function
combineEnrollment to combine edu01A and
edu01B into one list (edu01AB) containing two
tibbles corresponding to county level data (countyData) and
non-county level data (noncountyData).
edu01AB <- combineEnrollment(edu01A, edu01B)
edu01AB
## $countyData
## # A tibble: 62,900 × 6
## area_name STCOU measurementType schoolYear enrollment state
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Autauga, AL 01001 EDU0101 1987 6829 AL
## 2 Autauga, AL 01001 EDU0101 1988 6900 AL
## 3 Autauga, AL 01001 EDU0101 1989 6920 AL
## 4 Autauga, AL 01001 EDU0101 1990 6847 AL
## 5 Autauga, AL 01001 EDU0101 1991 7008 AL
## 6 Autauga, AL 01001 EDU0101 1992 7137 AL
## 7 Autauga, AL 01001 EDU0101 1993 7152 AL
## 8 Autauga, AL 01001 EDU0101 1994 7381 AL
## 9 Autauga, AL 01001 EDU0101 1995 7568 AL
## 10 Autauga, AL 01001 EDU0101 1996 7834 AL
## # … with 62,890 more rows
##
## $noncountyData
## # A tibble: 1,060 × 6
## area_name STCOU measurementType schoolYear enrollment division
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 UNITED STATES 00000 EDU0101 1987 40024299 ERROR
## 2 UNITED STATES 00000 EDU0101 1988 39967624 ERROR
## 3 UNITED STATES 00000 EDU0101 1989 40317775 ERROR
## 4 UNITED STATES 00000 EDU0101 1990 40737600 ERROR
## 5 UNITED STATES 00000 EDU0101 1991 41385442 ERROR
## 6 UNITED STATES 00000 EDU0101 1992 42088151 ERROR
## 7 UNITED STATES 00000 EDU0101 1993 42724710 ERROR
## 8 UNITED STATES 00000 EDU0101 1994 43369917 ERROR
## 9 UNITED STATES 00000 EDU0101 1995 43993459 ERROR
## 10 UNITED STATES 00000 EDU0101 1996 44715737 ERROR
## # … with 1,050 more rows
Running the code chunk below calls the custom plot.state
function to plot a line graph of the mean of the enrollment statistic
(avgEnrollment) for each division per year
observed (schoolYear).
plot(edu01AB$noncountyData, z = "enrollment")
## `summarise()` has grouped output by 'division'. You can override using the
## `.groups` argument.
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (enrollment) for the 7 (n) counties
(area_name) with the largest (org) enrollment
values in Pennsylvania (st) across the years
(schoolYear).
plot(edu01AB$countyData, z = "enrollment", st = "PA", org = "top", n = 7)
## Selecting by avgEnrollment
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (enrollment) for the 4 (n) counties
(area_name) with the smallest (org) enrollment
values in Pennsylvania (st) across the years
(schoolYear).
plot(edu01AB$countyData, z = "enrollment", st = "PA", org = "bottom", n = 4)
## Selecting by avgEnrollment
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (enrollment) for the 5 (n) counties
(area_name) with the largest (org) enrollment
values in Illinois (st) across the years
(schoolYear).
plot(edu01AB$countyData, z = "enrollment")
## Selecting by avgEnrollment
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (enrollment) for the 10 (n) counties
(area_name) with the largest (org) enrollment
values in Minnesota (st) across the years
(schoolYear).
plot(edu01AB$countyData, z = "enrollment", st = "MN", org = "top", n = 10)
## Selecting by avgEnrollment
Running the code chunk below calls the wrapper function
my_wrapper to read in and parse the four .csv
files for the last four data sets. The resulting objects
(pstA, pstB, pstC, and
pstD) are four lists containing two tibbles each
(pstA$countyData, pstA$noncountyData,
pstB$countyData, pstB$noncountyData,
pstC$countyData, pstC$noncountyData,
pstD$countyData and pstD$noncountyData).
pstA <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01a.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST015171N1, PST015171N2, PST015172N1, PST015172...
## dbl (20): PST015171F, PST015171D, PST015172F, PST015172D, PST015173F, PST015...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pstB <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01b.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST025182N1, PST025182N2, PST025183N1, PST025183...
## dbl (20): PST025182F, PST025182D, PST025183F, PST025183D, PST025184F, PST025...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pstC <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01c.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST035191N1, PST035191N2, PST035192N1, PST035192...
## dbl (20): PST035191F, PST035191D, PST035192F, PST035192D, PST035193F, PST035...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pstD <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01d.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST045200N1, PST045200N2, PST045201N1, PST045201...
## dbl (20): PST045200F, PST045200D, PST045201F, PST045201D, PST045202F, PST045...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Running the code chunk below calls the combine function
combineEnrollment three times to combine pstA,
pstB, pstC, and pstD into one
list (eduAB) containing two tibbles corresponding to county
level data (countyData) and non-county level data
(noncountyData).
pstAB <- combineEnrollment(pstA, pstB)
pstCD <- combineEnrollment(pstC, pstD)
pstABCD <- combineEnrollment(pstAB, pstCD)
pstABCD
## $countyData
## # A tibble: 125,800 × 6
## area_name STCOU measurementType schoolYear studentsEnrolled state
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Autauga, AL 01001 PST0151 1971 25508 AL
## 2 Autauga, AL 01001 PST0151 1972 27166 AL
## 3 Autauga, AL 01001 PST0151 1973 28463 AL
## 4 Autauga, AL 01001 PST0151 1974 29266 AL
## 5 Autauga, AL 01001 PST0151 1975 29718 AL
## 6 Autauga, AL 01001 PST0151 1976 29896 AL
## 7 Autauga, AL 01001 PST0151 1977 30462 AL
## 8 Autauga, AL 01001 PST0151 1978 30882 AL
## 9 Autauga, AL 01001 PST0151 1979 32055 AL
## 10 Autauga, AL 01001 PST0251 1981 31985 AL
## # … with 125,790 more rows
##
## $noncountyData
## # A tibble: 2,120 × 6
## area_name STCOU measurementType schoolYear studentsEnrolled division
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 UNITED STATES 00000 PST0151 1971 206827028 ERROR
## 2 UNITED STATES 00000 PST0151 1972 209283904 ERROR
## 3 UNITED STATES 00000 PST0151 1973 211357490 ERROR
## 4 UNITED STATES 00000 PST0151 1974 213341552 ERROR
## 5 UNITED STATES 00000 PST0151 1975 215465246 ERROR
## 6 UNITED STATES 00000 PST0151 1976 217562728 ERROR
## 7 UNITED STATES 00000 PST0151 1977 219759860 ERROR
## 8 UNITED STATES 00000 PST0151 1978 222095080 ERROR
## 9 UNITED STATES 00000 PST0151 1979 224567234 ERROR
## 10 UNITED STATES 00000 PST0251 1981 229466391 ERROR
## # … with 2,110 more rows
Running the code chunk below calls the custom plot.state
function to plot a line graph of the mean of the enrollment statistic
(avgEnrollment) for each division per year
observed (schoolYear) in the pstABCD
tibble.
plot(pstABCD$noncountyData)
## `summarise()` has grouped output by 'division'. You can override using the
## `.groups` argument.
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (studentsEnrolled) for the 6 (n)
counties (area_name) with the largest (org)
enrollment values in Connecticut (st) across the years
(schoolYear).
plot(pstABCD$countyData, st = "CT", org = "top", n = 6)
## Selecting by avgEnrollment
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (studentsEnrolled) for the 10 (n)
counties (area_name) with the smallest (org)
enrollment values in North Carolina (st) across the years
(schoolYear).
plot(pstABCD$countyData, st = "NC", org = "bottom", n = 10)
## Selecting by avgEnrollment
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (studentsEnrolled) for the 5 (n)
counties (area_name) with the largest (org)
enrollment values in Illinois (st) across the years
(schoolYear).
plot(pstABCD$countyData)
## Selecting by avgEnrollment
Running the code chunk below calls the custom
plot.county function to plot a line graph of the enrollment
statistic (studentsEnrolled) for the 4 (n)
counties (area_name) with the largest (org)
enrollment values in Minnesota (st) across the years
(schoolYear).
plot(pstABCD$countyData, st = "MN", n = 4)
## Selecting by avgEnrollment